An item sorting heuristic to derive equivalent parallel test versions from multivariate items

Parallel test versions require a comparable degree of difficulty and must capture the same characteristics using different items. This can become challenging when dealing with multivariate items, which are for example very common in language or image data. Here, we propose a heuristic to identify and select similar multivariate items for the generation of equivalent parallel test versions. This heuristic includes: 1. inspection of correlations between variables; 2. identification of outlying items; 3. application of a dimension-reduction method, such as for example principal component analysis (PCA); 4. generation of a biplot, in case of PCA of the first two principal components (PC), and grouping the displayed items; 5. assigning of the items to parallel test versions; and 6. checking the resulting test versions for multivariate equivalence, parallelism, reliability, and internal consistency. To illustrate the proposed heuristic, we applied it exemplarily on the items of a picture naming task. From a pool of 116 items, four parallel test versions were derived, each containing 20 items. We found that our heuristic can help to generate parallel test versions that meet requirements of the classical test theory, while simultaneously taking several variables into account.

Our items and variables, on which these principal components are based, were generated, and collected in the past. Among them, Response Time and Accuracy would be the variables most comparable to typical item response variables. These two, along with Image Agreement, form the first principal component, which accounts for 40.79% of the variance in the data. This noteworthy relationship might have escaped our attention, had we used the IRT approach.
Given that unidimensionality is considered a prerequisite for IRT and bidimensionality is a prerequisite for our dimension-reduction heuristic, it seems a legitimate question how high the proportion of explained variance should be in order to be considered an indicator of unidimensionality or bidimensionality.
Hattie [36] refers to authors who propagate 20% or 40% for unidimensionality. However, in his conclusion, he suggests that it may be unrealistic to search for indications of unidimensionality, and that the test score is basically a weighted composite of all the underlying variables. We share and address this idea by suggesting the use of multivariate items and dimension-reduction procedures.
A common criterion for dimension-reduction methods is to retain as many components until about 70-90% of the variance is explained [21][22][23] In summary, while IRT usually assumes one latent variable, we assume two principal components from a variety of variables that ideally cover a large portion of the variance in the data. This approach helps to represent items on a two-dimensional graph, in which similarity of items is represented by spatial proximity." 4. The authors do not always specify which correlation coefficients they utilized in their practical application of the proposed heuristic (e.g., in Table 2). This reviewer assumes that they refer to Pearson correlation coefficients. If that is the case, were the items not polytomous and ordinal (ordered categorical), in which case polychoric correlations would have been the more accurate coefficient to apply?
This is an important point and a very helpful advice, thank you. Indeed, we used Pearson's correlation coefficients. We did so under the assumption, which is common in psychometric research, that a Likert scale with at least 5 points may be considered as interval-scaled and treated as continuous in the analysis. However, prompted by the Reviewer's comment, we realized that this is justifiably controversial (Allen & Seaman, 2007;Jamieson, 2004;Joshi et al., 2015;Leung, 2011;Sullivan & Artino, 2013;Wu & Leung, 2017). As such, we have computed the polychoric correlations, which yielded the same results as Pearson's correlation coefficients (see the revised Table 2 on page 5). .68 *** -Note. Correlations indicate that PCA is applicable. ObCl = object class, NaAg = naming agreement, ImAg = image agreement, ImCo = image complexity, ACC = accuracy, RT = response time. Significance levels: *** < 0.001, ** < 0.01, * < 0.05. Test statistics based on Pearson's product moment correlations and polychoric correlation coefficients.
5. The authors indicate that, since the variables show a skewed distribution, it became necessary to perform a logarithmic transformation. Could the authors add a graphic plot (perhaps as an appendix) demonstrating that a logarithmic transformation provided better fit to the data than alternatives such as a gamma distribution or polynomial function?
Following the Reviewer's suggestion, in the revised manuscript, next to the quantile-quantile plot of the Response Time and its logarithmic-transformation, we have added a box-coxtransformation plot, which yielded very similar results to log-transformation. This is indicated in the revised Figure 4 and on page 8 of the revised manuscript: "We found logarithmic and box-cox-transformation to deliver a very similar result." 6. In addition to the quantile-quantile plots of the residuals, could the authors also provide findings from statistical tests of normality (e.g.  We thank the Reviewer for spotting this grammatical infelicity. We have replaced 'hold' with 'held' in the revised manuscript (now page 9, line 298).